df <- read.csv("simulated_diabetes.csv")
str(df)
## 'data.frame': 5000 obs. of 18 variables:
## $ age : int 18 53 41 40 83 82 29 19 24 66 ...
## $ gender : chr "Female" "Male" "Male" "Male" ...
## $ smoking_status : chr "Former" "Never" "Former" "Never" ...
## $ physical_activity_minutes_per_week: int 80 66 20 106 45 123 103 63 207 211 ...
## $ diet_score : num 6.3 6.3 9.3 7.8 7.6 5.8 6.2 6 3.7 6.7 ...
## $ sleep_hours_per_day : num 9 7.6 8.5 6.8 5.7 8.9 6.2 5.3 7.3 6.8 ...
## $ screen_time_hours_per_day : num 8.3 8.9 8.9 0.5 5 7.3 4.7 7.3 10.5 5.4 ...
## $ family_history_diabetes : int 0 1 0 0 1 0 0 0 0 1 ...
## $ bmi : num 21.9 25 28.2 29.3 31.1 26.4 32.1 24.6 32.2 27.1 ...
## $ systolic_bp : int 90 127 118 91 140 143 95 110 114 116 ...
## $ diastolic_bp : int 82 65 80 90 85 84 80 59 65 90 ...
## $ heart_rate : int 82 65 85 77 71 61 73 66 69 60 ...
## $ cholesterol_total : int 141 184 211 222 196 173 154 214 180 240 ...
## $ hdl_cholesterol : int 46 73 39 59 31 51 75 72 50 63 ...
## $ ldl_cholesterol : int 64 81 139 142 136 78 50 122 114 153 ...
## $ triglycerides : int 175 169 167 31 172 172 161 59 128 181 ...
## $ hba1c : num 6.85 5.84 5.37 7.36 5.7 8.43 6.53 6.7 6.32 7.35 ...
## $ diagnosed_diabetes : int 1 0 0 1 0 1 1 1 1 1 ...
head(df)
## age gender smoking_status physical_activity_minutes_per_week diet_score
## 1 18 Female Former 80 6.3
## 2 53 Male Never 66 6.3
## 3 41 Male Former 20 9.3
## 4 40 Male Never 106 7.8
## 5 83 Male Never 45 7.6
## 6 82 Female Never 123 5.8
## sleep_hours_per_day screen_time_hours_per_day family_history_diabetes bmi
## 1 9.0 8.3 0 21.9
## 2 7.6 8.9 1 25.0
## 3 8.5 8.9 0 28.2
## 4 6.8 0.5 0 29.3
## 5 5.7 5.0 1 31.1
## 6 8.9 7.3 0 26.4
## systolic_bp diastolic_bp heart_rate cholesterol_total hdl_cholesterol
## 1 90 82 82 141 46
## 2 127 65 65 184 73
## 3 118 80 85 211 39
## 4 91 90 77 222 59
## 5 140 85 71 196 31
## 6 143 84 61 173 51
## ldl_cholesterol triglycerides hba1c diagnosed_diabetes
## 1 64 175 6.85 1
## 2 81 169 5.84 0
## 3 139 167 5.37 0
## 4 142 31 7.36 1
## 5 136 172 5.70 0
## 6 78 172 8.43 1
dim(df)
## [1] 5000 18
# Check for missing values
sum(is.na(df))
## [1] 0
# Check the ranges of continuous variables
ranges <- sapply(df[, sapply(df, is.numeric)],
function(x) c(min = min(x, na.rm = TRUE),
max = max(x, na.rm = TRUE)))
t(ranges)
## min max
## age 18.0 90.00
## physical_activity_minutes_per_week 2.0 651.00
## diet_score 0.0 10.00
## sleep_hours_per_day 3.0 10.00
## screen_time_hours_per_day 0.5 15.00
## family_history_diabetes 0.0 1.00
## bmi 15.0 38.90
## systolic_bp 90.0 167.00
## diastolic_bp 50.0 107.00
## heart_rate 40.0 99.00
## cholesterol_total 100.0 310.00
## hdl_cholesterol 20.0 96.00
## ldl_cholesterol 50.0 263.00
## triglycerides 30.0 278.00
## hba1c 4.0 9.51
## diagnosed_diabetes 0.0 1.00
Comment: The dataset contains 5000 observations with no missing values. All continuous variables fall within expected physiological or behavioral ranges, indicating there are no obvious data entry errors or implausible values. The structure and initial rows confirm that variable types were correctly loaded, and the dataset is ready for preprocessing and modeling.
# Convert gender, smoking_status, family_history_diabetes, and diagnosed_diabetes as factor variables
df <- df %>% mutate(
gender = as.factor(gender),
smoking_status = as.factor(smoking_status),
family_history_diabetes = as.factor(family_history_diabetes),
diagnosed_diabetes = as.factor(diagnosed_diabetes)
)
# Check the factor variables
table(df$gender)
##
## Female Male Other
## 2529 2390 81
table(df$smoking_status)
##
## Current Former Never
## 964 995 3041
table(df$family_history_diabetes)
##
## 0 1
## 3915 1085
table(df$diagnosed_diabetes)
##
## 0 1
## 2064 2936
Comment: String variables were successfully converted to factors, and the resulting category distributions look appropriate. No unexpected or empty categories were detected, so these variables are now properly formatted for modeling. Gender includes a small “Other” category, which may reflect participants who do not identify strictly as male or female and choose to report a nonbinary or alternative gender identity.
summary(df)
## age gender smoking_status
## Min. :18.00 Female:2529 Current: 964
## 1st Qu.:39.00 Male :2390 Former : 995
## Median :50.00 Other : 81 Never :3041
## Mean :49.75
## 3rd Qu.:61.00
## Max. :90.00
## physical_activity_minutes_per_week diet_score sleep_hours_per_day
## Min. : 2.0 Min. : 0.000 Min. : 3.000
## 1st Qu.: 57.0 1st Qu.: 4.700 1st Qu.: 6.300
## Median : 98.0 Median : 6.000 Median : 7.000
## Mean :118.4 Mean : 5.943 Mean : 7.022
## 3rd Qu.:158.0 3rd Qu.: 7.200 3rd Qu.: 7.800
## Max. :651.0 Max. :10.000 Max. :10.000
## screen_time_hours_per_day family_history_diabetes bmi
## Min. : 0.500 0:3915 Min. :15.00
## 1st Qu.: 4.300 1:1085 1st Qu.:23.20
## Median : 5.900 Median :25.65
## Mean : 5.958 Mean :25.63
## 3rd Qu.: 7.600 3rd Qu.:28.10
## Max. :15.000 Max. :38.90
## systolic_bp diastolic_bp heart_rate cholesterol_total
## Min. : 90.0 Min. : 50.00 Min. :40.00 Min. :100.0
## 1st Qu.:105.0 1st Qu.: 70.00 1st Qu.:64.00 1st Qu.:164.0
## Median :116.0 Median : 75.00 Median :70.00 Median :186.0
## Mean :115.7 Mean : 75.32 Mean :69.69 Mean :185.9
## 3rd Qu.:125.0 3rd Qu.: 81.00 3rd Qu.:76.00 3rd Qu.:208.0
## Max. :167.0 Max. :107.00 Max. :99.00 Max. :310.0
## hdl_cholesterol ldl_cholesterol triglycerides hba1c
## Min. :20.00 Min. : 50.0 Min. : 30 Min. :4.000
## 1st Qu.:47.00 1st Qu.: 78.0 1st Qu.: 91 1st Qu.:5.940
## Median :54.00 Median :102.0 Median :121 Median :6.500
## Mean :53.99 Mean :103.1 Mean :121 Mean :6.501
## 3rd Qu.:61.00 3rd Qu.:127.0 3rd Qu.:151 3rd Qu.:7.060
## Max. :96.00 Max. :263.0 Max. :278 Max. :9.510
## diagnosed_diabetes
## 0:2064
## 1:2936
##
##
##
##
# histograms for numeric columns to visually assess distributions
df %>%
select_if(is.numeric) %>%
gather(variable, value) %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30) +
facet_wrap(~ variable, scales = "free", ncol = 2) +
labs(title = "Distributions of Continuous Variables")
Comment: I checked the continuous variables using summary statistics and visualized their distributions with histograms. No variables appeared to be incorrectly coded as numeric indicators, and the observed ranges were consistent with what would be expected for an adult population. Several variables, such as physical activity minutes per week, screen time hours per day, and triglycerides, showed noticeable right-skewness, which is typical for behavioral and biomarker data. Other variables, including heart rate and blood pressure, displayed more symmetric distributions. Overall, no unusual or implausible values were detected, and the continuous variables appear clean and ready for analysis.
# training+validation = 75% vs. testing = 25%
set.seed(123)
trainval <- createDataPartition(y=df$diagnosed_diabetes, p=0.75, list=FALSE)
dftrval <- df[trainval,]
dftest <- df[-trainval,]
# training = 50% vs. Validation = 25%
set.seed(123)
train <- createDataPartition(y=dftrval$diagnosed_diabetes, p=2/3, list=FALSE)
dftr <- dftrval[train,]
dfval <- dftrval[-train,]
# Verify the data
dim(df); dim(dftrval); dim(dftest); dim(dftr); dim(dfval)
## [1] 5000 18
## [1] 3750 18
## [1] 1250 18
## [1] 2500 18
## [1] 1250 18
summary(df)
## age gender smoking_status
## Min. :18.00 Female:2529 Current: 964
## 1st Qu.:39.00 Male :2390 Former : 995
## Median :50.00 Other : 81 Never :3041
## Mean :49.75
## 3rd Qu.:61.00
## Max. :90.00
## physical_activity_minutes_per_week diet_score sleep_hours_per_day
## Min. : 2.0 Min. : 0.000 Min. : 3.000
## 1st Qu.: 57.0 1st Qu.: 4.700 1st Qu.: 6.300
## Median : 98.0 Median : 6.000 Median : 7.000
## Mean :118.4 Mean : 5.943 Mean : 7.022
## 3rd Qu.:158.0 3rd Qu.: 7.200 3rd Qu.: 7.800
## Max. :651.0 Max. :10.000 Max. :10.000
## screen_time_hours_per_day family_history_diabetes bmi
## Min. : 0.500 0:3915 Min. :15.00
## 1st Qu.: 4.300 1:1085 1st Qu.:23.20
## Median : 5.900 Median :25.65
## Mean : 5.958 Mean :25.63
## 3rd Qu.: 7.600 3rd Qu.:28.10
## Max. :15.000 Max. :38.90
## systolic_bp diastolic_bp heart_rate cholesterol_total
## Min. : 90.0 Min. : 50.00 Min. :40.00 Min. :100.0
## 1st Qu.:105.0 1st Qu.: 70.00 1st Qu.:64.00 1st Qu.:164.0
## Median :116.0 Median : 75.00 Median :70.00 Median :186.0
## Mean :115.7 Mean : 75.32 Mean :69.69 Mean :185.9
## 3rd Qu.:125.0 3rd Qu.: 81.00 3rd Qu.:76.00 3rd Qu.:208.0
## Max. :167.0 Max. :107.00 Max. :99.00 Max. :310.0
## hdl_cholesterol ldl_cholesterol triglycerides hba1c
## Min. :20.00 Min. : 50.0 Min. : 30 Min. :4.000
## 1st Qu.:47.00 1st Qu.: 78.0 1st Qu.: 91 1st Qu.:5.940
## Median :54.00 Median :102.0 Median :121 Median :6.500
## Mean :53.99 Mean :103.1 Mean :121 Mean :6.501
## 3rd Qu.:61.00 3rd Qu.:127.0 3rd Qu.:151 3rd Qu.:7.060
## Max. :96.00 Max. :263.0 Max. :278 Max. :9.510
## diagnosed_diabetes
## 0:2064
## 1:2936
##
##
##
##
# Histogram of numeric variables
hist(df$age)
hist(df$physical_activity_minutes_per_week)
hist(df$diet_score)
hist(df$sleep_hours_per_day)
hist(df$screen_time_hours_per_day)
hist(df$bmi)
hist(df$systolic_bp)
hist(df$diastolic_bp)
hist(df$heart_rate)
hist(df$cholesterol_total)
hist(df$hdl_cholesterol)
hist(df$ldl_cholesterol)
hist(df$triglycerides)
hist(df$hba1c)
# List of numeric variables
num_vars <- c("age", "physical_activity_minutes_per_week", "diet_score", "sleep_hours_per_day", "screen_time_hours_per_day", "bmi", "systolic_bp", "diastolic_bp", "heart_rate", "cholesterol_total", "hdl_cholesterol", "ldl_cholesterol", "triglycerides", "hba1c")
# Correlation among numeric variables
cor_matrix <- cor(df[num_vars], use = "pairwise.complete.obs")
round(cor_matrix[1:6, 1:6], 2)
## age physical_activity_minutes_per_week
## age 1.00 0.00
## physical_activity_minutes_per_week 0.00 1.00
## diet_score 0.00 0.00
## sleep_hours_per_day 0.00 -0.03
## screen_time_hours_per_day 0.01 0.01
## bmi 0.09 -0.07
## diet_score sleep_hours_per_day
## age 0.00 0.00
## physical_activity_minutes_per_week 0.00 -0.03
## diet_score 1.00 0.01
## sleep_hours_per_day 0.01 1.00
## screen_time_hours_per_day -0.01 -0.01
## bmi -0.19 -0.01
## screen_time_hours_per_day bmi
## age 0.01 0.09
## physical_activity_minutes_per_week 0.01 -0.07
## diet_score -0.01 -0.19
## sleep_hours_per_day -0.01 -0.01
## screen_time_hours_per_day 1.00 -0.01
## bmi -0.01 1.00
# Graphical summaries
par(mfrow = c(2, 3))
for (v in num_vars) {
hist(df[[v]], main = v, col = "lightblue", border = "white")
}
# Pairwise scatterplot matrix
pairs(df[, num_vars[1:5]], main = "Pairwise scatterplot matrix")
# Summary of categorical variables
summary(df[, c("gender", "smoking_status", "family_history_diabetes", "diagnosed_diabetes")])
## gender smoking_status family_history_diabetes diagnosed_diabetes
## Female:2529 Current: 964 0:3915 0:2064
## Male :2390 Former : 995 1:1085 1:2936
## Other : 81 Never :3041
cat_vars <- df[, sapply(df, is.character) | sapply(df, is.factor)]
lapply(cat_vars, function(x) {
list(
table = table(x),
proportion = prop.table(table(x))
)
})
## $gender
## $gender$table
## x
## Female Male Other
## 2529 2390 81
##
## $gender$proportion
## x
## Female Male Other
## 0.5058 0.4780 0.0162
##
##
## $smoking_status
## $smoking_status$table
## x
## Current Former Never
## 964 995 3041
##
## $smoking_status$proportion
## x
## Current Former Never
## 0.1928 0.1990 0.6082
##
##
## $family_history_diabetes
## $family_history_diabetes$table
## x
## 0 1
## 3915 1085
##
## $family_history_diabetes$proportion
## x
## 0 1
## 0.783 0.217
##
##
## $diagnosed_diabetes
## $diagnosed_diabetes$table
## x
## 0 1
## 2064 2936
##
## $diagnosed_diabetes$proportion
## x
## 0 1
## 0.4128 0.5872
ggplot(df, aes(x = gender)) +
geom_bar() +
coord_flip() +
theme_minimal() +
labs(title = "Distribution of gender",
x = "gender",
y = "Count")
ggplot(df, aes(x = smoking_status)) +
geom_bar() +
coord_flip() +
theme_minimal()
labs(title = "Distribution of smoking status",
x = "smoking_status",
y = "Count")
## <ggplot2::labels> List of 3
## $ x : chr "smoking_status"
## $ y : chr "Count"
## $ title: chr "Distribution of smoking status"
ggplot(df, aes(x = factor(family_history_diabetes))) +
geom_bar() +
theme_minimal()
labs(title = "Distribution of family history of diabetes",
x = "family_history_diabetes",
y = "Count")
## <ggplot2::labels> List of 3
## $ x : chr "family_history_diabetes"
## $ y : chr "Count"
## $ title: chr "Distribution of family history of diabetes"
ggplot(df, aes(x = factor(diagnosed_diabetes))) +
geom_bar() +
theme_minimal()
labs(title = "Distribution of diagnosed diabetes",
x = "diagnosed_diabetes",
y = "Count")
## <ggplot2::labels> List of 3
## $ x : chr "diagnosed_diabetes"
## $ y : chr "Count"
## $ title: chr "Distribution of diagnosed diabetes"
age: Age ranges from 18 to 90, with a median of 50. The distribution is roughly symmetric and centered in mid-adulthood, indicating broad representation across the adult lifespan.
physical_activity_minutes_per_week: Weekly physical activity ranges from 2 to 651 minutes, with a strong right-skew driven by a small number of highly active individuals. Most participants cluster between 50 and 200 minutes per week.
diet_score: Diet score ranges from 0 to 10, with a median near 6. The distribution is approximately symmetric, suggesting a balanced spread of dietary quality levels across the sample.
sleep_hours_per_day: Sleep hours range from 3 to 10, with a median of 7. The distribution is bell-shaped, indicating most individuals report 6–8 hours per night.
screen_time_hours_per_day: Screen time ranges from 0.5 to 15 hours per day and shows right-skewness. While most participants report around 4–7 hours, a smaller group reports much higher use.
bmi: BMI ranges from 15.0 to 38.9, with a median of 25.7. The distribution is moderately symmetric and centered in the mid-20s, consistent with a population spanning normal weight to obesity.
systolic_bp: Systolic blood pressure ranges from 90 to 167, with a median of 116. The distribution is right-skewed, reflecting the presence of individuals with elevated blood pressure.
diastolic_bp: Diastolic blood pressure ranges from 50 to 107, with a median of 75. The distribution is roughly symmetric, centered within the typical normal-to-borderline range.
heart_rate: Heart rate ranges from 40 to 99 beats per minute, with a median of 70. The distribution appears symmetric and aligns with expected resting heart rate values.
cholesterol_total: Total cholesterol ranges from 100 to 310, with a median of 186. The distribution is slightly right-skewed, with most values falling within standard clinical ranges.
hdl_cholesterol: HDL cholesterol ranges from 20 to 96, with a median of 54. Values follow a symmetric distribution centered around typical HDL levels.
ldl_cholesterol: LDL cholesterol ranges from 50 to 263, with a median of 102. Although mostly centered in the normal range, the distribution shows a right tail reflecting individuals with elevated LDL.
triglycerides: Triglycerides range from 30 to 278, showing right-skewness. Most values fall around 100–150, but a subset of participants contributes to a long upper tail.
hba1c: HbA1c ranges from 4.0 to 9.5, with a median of 6.5. The distribution is slightly right-skewed, reflecting a mix of normoglycemic individuals and those with elevated glucose levels.
gender: Gender includes three categories: Female (50.6 percent), Male (47.8 percent), and a small Other group (1.6 percent). The distribution is nearly balanced between males and females, with a very small proportion identifying as Other.
smoking_status: Smoking status consists of Current (19.3 percent), Former (19.9 percent), and Never smokers (60.8 percent). Most participants report never smoking, while current and former smokers are represented in similar proportions.
family_history_diabetes: Family history of diabetes is coded as 0 or 1, with 78.3 percent reporting no family history and 21.7 percent reporting a positive history. This distribution suggests that family history of diabetes is present in a meaningful minority of the sample.
diagnosed_diabetes: Diagnosed diabetes is also binary, with 41.3 percent reporting no diagnosis and 58.7 percent reporting having been diagnosed. The relatively high proportion of diagnosed diabetes reflects the simulated dataset’s design to include a substantial number of individuals with diabetes.
# Build tree (exclude hba1c)
set.seed(123)
diab.tree <- tree(diagnosed_diabetes ~ . - hba1c, data = dftr)
summary(diab.tree)
##
## Classification tree:
## tree(formula = diagnosed_diabetes ~ . - hba1c, data = dftr)
## Variables actually used in tree construction:
## [1] "family_history_diabetes" "age"
## Number of terminal nodes: 3
## Residual mean deviance: 1.301 = 3250 / 2497
## Misclassification error rate: 0.4084 = 1021 / 2500
# Plot tree
plot(diab.tree)
text(diab.tree, pretty = 1, cex = 0.6)
# Predict on validation set
diab.pred <- predict(diab.tree, newdata = dfval, type = "class")
# Validation accuracy
conf <- table(Predicted = diab.pred, Actual = dfval$diagnosed_diabetes)
conf
## Actual
## Predicted 0 1
## 0 351 372
## 1 165 362
val_acc <- sum(diab.pred == dfval$diagnosed_diabetes) / nrow(dfval)
val_acc
## [1] 0.5704
Comment: The final classification tree contained 3 terminal nodes. The first node represents individuals with no family history of diabetes and age less than 60.5 years, and this node predicts no diagnosed diabetes (class 0). The second node contains individuals with no family history but age 60.5 years or older, and it predicts diagnosed diabetes (class 1). The third node includes all individuals with a positive family history of diabetes, regardless of age, and also predicts diagnosed diabetes (class 1). When applied to the validation set, the tree achieved a classification accuracy of 0.5704, corresponding to approximately 57 percent correctly classified observations.
set.seed(123)
# Random forest with mtry = 4
rf4 <- randomForest(
diagnosed_diabetes ~ . - hba1c,
data = dftr,
mtry = 4,
importance = TRUE
)
# Predict validation set
pred4 <- predict(rf4, newdata = dfval, type = "class")
acc4 <- mean(pred4 == dfval$diagnosed_diabetes)
acc4
## [1] 0.596
# Random forest with mtry = 16
rf16 <- randomForest(
diagnosed_diabetes ~ . - hba1c,
data = dftr,
mtry = 16,
importance = TRUE
)
# Predict validation set
pred16 <- predict(rf16, newdata = dfval, type = "class")
acc16 <- mean(pred16 == dfval$diagnosed_diabetes)
acc16
## [1] 0.592
# Variable importance plots
varImpPlot(rf4, main = "Variable Importance (mtry = 4)")
varImpPlot(rf16, main = "Variable Importance (mtry = 16)")
# Numeric importance values
importance(rf4)
## 0 1 MeanDecreaseAccuracy
## age 2.9106674 10.01016800 10.4292845
## gender -0.5101661 -0.05574139 -0.4060277
## smoking_status -2.5353483 -1.41231167 -2.6761152
## physical_activity_minutes_per_week 1.8022128 5.23803107 4.8724241
## diet_score 2.9675872 -1.44963830 0.9768614
## sleep_hours_per_day 1.8914093 -1.85860346 -0.1477519
## screen_time_hours_per_day -1.1728497 -2.18171737 -2.4716202
## family_history_diabetes 23.1306787 15.27501332 25.6431233
## bmi 1.7178127 6.86758391 6.3547409
## systolic_bp 3.4051658 2.81477710 4.5092053
## diastolic_bp 2.4917038 -2.21383496 -0.2018646
## heart_rate -2.1117244 2.41196324 0.4922880
## cholesterol_total 1.1021219 4.92255555 5.0727158
## hdl_cholesterol 4.7458061 2.56333285 5.1525057
## ldl_cholesterol -0.8124204 2.17085413 1.3052503
## triglycerides 0.3497661 5.93905969 4.7989056
## MeanDecreaseGini
## age 98.99092
## gender 15.86020
## smoking_status 24.25933
## physical_activity_minutes_per_week 101.41047
## diet_score 86.28352
## sleep_hours_per_day 82.14977
## screen_time_hours_per_day 87.99840
## family_history_diabetes 40.34213
## bmi 97.66736
## systolic_bp 86.24187
## diastolic_bp 75.02284
## heart_rate 75.66762
## cholesterol_total 84.65377
## hdl_cholesterol 84.09817
## ldl_cholesterol 76.68814
## triglycerides 93.80277
importance(rf16)
## 0 1 MeanDecreaseAccuracy
## age 1.7520901 13.92104130 11.986278216
## gender 0.6973503 0.73510727 1.007488233
## smoking_status -1.1983922 2.83844642 1.493587355
## physical_activity_minutes_per_week 3.4126135 5.99460392 6.719832534
## diet_score -0.2619317 -0.57290168 -0.674957426
## sleep_hours_per_day 3.2977540 -2.65110246 0.216079738
## screen_time_hours_per_day -3.2127876 -1.95705130 -3.657706695
## family_history_diabetes 26.2562005 16.63355898 28.485102350
## bmi 3.1482750 7.15717096 7.672483461
## systolic_bp 2.2698859 5.39478094 5.787445505
## diastolic_bp 1.5638384 -0.04137306 1.027256148
## heart_rate -2.2026319 2.05696996 -0.004237676
## cholesterol_total 0.2437571 6.36585151 6.090996614
## hdl_cholesterol 0.4943518 3.88314897 3.487239232
## ldl_cholesterol -3.5861710 4.59114302 1.688794765
## triglycerides -2.6978055 5.42416641 2.268822911
## MeanDecreaseGini
## age 102.42393
## gender 12.59000
## smoking_status 20.13815
## physical_activity_minutes_per_week 107.43660
## diet_score 88.43551
## sleep_hours_per_day 83.88539
## screen_time_hours_per_day 88.89550
## family_history_diabetes 42.73190
## bmi 103.09155
## systolic_bp 83.94659
## diastolic_bp 75.35478
## heart_rate 76.08679
## cholesterol_total 79.08438
## hdl_cholesterol 86.30698
## ldl_cholesterol 65.64828
## triglycerides 95.39584
Comment: Between the two models, mtry = 4 produced the highest validation accuracy (0.596), slightly outperforming mtry = 16 (0.592). Therefore, mtry = 4 provides the best predictive performance. Across both models, the most influential predictors were family history, metabolic, and lifestyle variables. Variables such as family_history_diabetes, physical_activity_minutes_per_week, BMI, age, triglycerides, screen_time_hours_per_day, diet_score, systolic_bp, HDL cholesterol, and total cholesterol consistently showed high importance based on both MeanDecreaseAccuracy and MeanDecreaseGini. The paired importance plots confirm that diabetes diagnosis in this dataset is largely driven by a combination of metabolic health factors and behavioral patterns. The predicted validation accuracies were 0.596 for mtry = 4 and 0.592 for mtry = 16. Thus, the random forest with mtry = 4 achieves the best predictive accuracy.
## Full logistic regression (training set)
glm_full <- glm(
diagnosed_diabetes ~ . - hba1c,
data = dftr,
family = binomial
)
summary(glm_full)
##
## Call:
## glm(formula = diagnosed_diabetes ~ . - hba1c, family = binomial,
## data = dftr)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.9245329 0.8510000 -2.261 0.0237 *
## age 0.0175330 0.0034249 5.119 3.07e-07 ***
## genderMale 0.1087732 0.0858735 1.267 0.2053
## genderOther 0.2413348 0.3451864 0.699 0.4845
## smoking_statusFormer -0.0302847 0.1385412 -0.219 0.8270
## smoking_statusNever -0.0589888 0.1140289 -0.517 0.6049
## physical_activity_minutes_per_week -0.0022916 0.0005229 -4.383 1.17e-05 ***
## diet_score -0.0606903 0.0240223 -2.526 0.0115 *
## sleep_hours_per_day 0.0145636 0.0386959 0.376 0.7066
## screen_time_hours_per_day 0.0230782 0.0175071 1.318 0.1874
## family_history_diabetes1 1.0847880 0.1170133 9.271 < 2e-16 ***
## bmi 0.0278031 0.0137038 2.029 0.0425 *
## systolic_bp 0.0074614 0.0036527 2.043 0.0411 *
## diastolic_bp 0.0008221 0.0053701 0.153 0.8783
## heart_rate 0.0049275 0.0052336 0.942 0.3464
## cholesterol_total 0.0009862 0.0040608 0.243 0.8081
## hdl_cholesterol -0.0103927 0.0057682 -1.802 0.0716 .
## ldl_cholesterol -0.0013528 0.0040800 -0.332 0.7402
## triglycerides 0.0011692 0.0010700 1.093 0.2745
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3389.3 on 2499 degrees of freedom
## Residual deviance: 3162.4 on 2481 degrees of freedom
## AIC: 3200.4
##
## Number of Fisher Scoring iterations: 4
## Extract p values (excluding intercept)
coef_tab <- coef(summary(glm_full))
pvals <- coef_tab[-1, "Pr(>|z|)"] # drop intercept
varnames <- names(pvals)
m <- length(pvals) # number of tests
m
## [1] 18
pvals
## age genderMale
## 3.066612e-07 2.052741e-01
## genderOther smoking_statusFormer
## 4.844624e-01 8.269637e-01
## smoking_statusNever physical_activity_minutes_per_week
## 6.049365e-01 1.172132e-05
## diet_score sleep_hours_per_day
## 1.152345e-02 7.066496e-01
## screen_time_hours_per_day family_history_diabetes1
## 1.874306e-01 1.850418e-20
## bmi systolic_bp
## 4.247197e-02 4.108104e-02
## diastolic_bp heart_rate
## 8.783288e-01 3.464420e-01
## cholesterol_total hdl_cholesterol
## 8.081176e-01 7.158670e-02
## ldl_cholesterol triglycerides
## 7.402129e-01 2.744824e-01
alpha <- 0.05
sig_unadj <- varnames[pvals < alpha]
sig_unadj
## [1] "age" "physical_activity_minutes_per_week"
## [3] "diet_score" "family_history_diabetes1"
## [5] "bmi" "systolic_bp"
Comment: At the unadjusted α = 0.05 level, the significant predictors were age, physical_activity_minutes_per_week, diet_score, family_history_diabetes, bmi, and systolic_bp.
alpha_bonf <- alpha / m # Bonferroni threshold
alpha_bonf
## [1] 0.002777778
sig_bonf <- varnames[pvals < alpha_bonf]
sig_bonf
## [1] "age" "physical_activity_minutes_per_week"
## [3] "family_history_diabetes1"
Comment: Using Bonferroni’s method to control the family-wise error rate, the significance threshold becomes α/m = 0.05/18 ≈ 0.00278. Under this more conservative cutoff, only three variables remained statistically significant: age, physical_activity_minutes_per_week, and family_history_diabetes. These predictors had p-values below the Bonferroni-adjusted threshold and therefore showed strong evidence of association with diagnosed diabetes even after controlling for multiple testing.
# Order p values
ord <- order(pvals)
p_sorted <- pvals[ord]
vars_sorted <- varnames[ord]
# Holm critical values: alpha/(m - k + 1)
k_idx <- seq_along(p_sorted)
crit_H <- alpha / (m - k_idx + 1)
# Find largest k with p_(k) <- crit_H(k)
rej_idx <- which(p_sorted <= crit_H)
if (length(rej_idx) == 0) {
sig_holm <- character(0)
} else {
k_max <- max(rej_idx)
sig_holm <- vars_sorted[1:k_max]
}
crit_H
## [1] 0.002777778 0.002941176 0.003125000 0.003333333 0.003571429 0.003846154
## [7] 0.004166667 0.004545455 0.005000000 0.005555556 0.006250000 0.007142857
## [13] 0.008333333 0.010000000 0.012500000 0.016666667 0.025000000 0.050000000
sig_holm
## [1] "family_history_diabetes1" "age"
## [3] "physical_activity_minutes_per_week"
Comment: Using Holm’s step-down procedure to control the family-wise error rate, the p-values were first sorted and compared to the sequential Holm critical values α/(m − k + 1). Three variables met the Holm rejection criteria: age, physical_activity_minutes_per_week, and family_history_diabetes. The Holm-adjusted set was identical to the Bonferroni-adjusted set in this dataset. These variables remained significant even after applying Holm’s more powerful FWER correction, indicating strong evidence of association with diagnosed diabetes.
q <- 0.05
k_idx <- seq_along(p_sorted)
crit_BH <- (k_idx / m) * q
rej_idx_BH <- which(p_sorted <= crit_BH)
if (length(rej_idx_BH) == 0) {
sig_fdr <- character(0)
} else {
k_max_BH <- max(rej_idx_BH)
sig_fdr <- vars_sorted[1:k_max_BH]
}
crit_BH
## [1] 0.002777778 0.005555556 0.008333333 0.011111111 0.013888889 0.016666667
## [7] 0.019444444 0.022222222 0.025000000 0.027777778 0.030555556 0.033333333
## [13] 0.036111111 0.038888889 0.041666667 0.044444444 0.047222222 0.050000000
sig_fdr
## [1] "family_history_diabetes1" "age"
## [3] "physical_activity_minutes_per_week"
Comment: Applying the Benjamini–Hochberg procedure to control the false discovery rate at q = 0.05, the p-values were ordered and compared to the BH critical vaules (k/m)q. Three variables met the BH rejection criterion and were therefore significant after controlling for FDR: age, physical_activity_minutes_per_week, and family_history_diabetes. As with Holm and Bonferroni, the same three predictors were retained under FDR correction.These variables had sufficiently small p-values relative to their BH thresholds, indicating that they remain statistically significant even when controlling the expected proportion of false discoveries.
# a) Unadjusted 0.05
vars_unadj <- c(
"age",
"diet_score",
"bmi",
"physical_activity_minutes_per_week",
"family_history_diabetes",
"systolic_bp"
)
# b) Bonferroni
vars_bonf <- c(
"family_history_diabetes",
"physical_activity_minutes_per_week",
"age"
)
# c) Holm
vars_holm <- c(
"family_history_diabetes",
"physical_activity_minutes_per_week",
"age"
)
# d) FDR
vars_fdr <- c(
"family_history_diabetes",
"physical_activity_minutes_per_week",
"age"
)
# Validation error function
logit_val_error <- function(vars, dftr, dfval, threshold = 0.5) {
if (length(vars) == 0) {
form <- diagnosed_diabetes ~ 1
} else {
form <- as.formula(
paste("diagnosed_diabetes ~", paste(vars, collapse = " + "))
)
}
fit <- glm(form, data = dftr, family = binomial)
p_hat <- predict(fit, newdata = dfval, type = "response")
y_hat <- ifelse(p_hat >= threshold, 1, 0)
y_true <- ifelse(dfval$diagnosed_diabetes %in% c(1, "1"), 1, 0)
acc <- mean(y_hat == y_true)
err <- 1 - acc
list(model = fit, accuracy = acc, error = err)
}
# 1) unadjusted model
res_unadj <- logit_val_error(vars_unadj, dftr, dfval)
res_unadj$accuracy
## [1] 0.6208
res_unadj$error
## [1] 0.3792
# 2) Bonferroni model
res_bonf <- logit_val_error(vars_bonf, dftr, dfval)
res_bonf$accuracy
## [1] 0.6064
res_bonf$error
## [1] 0.3936
# 3) Holm model
res_holm <- logit_val_error(vars_holm, dftr, dfval)
res_holm$accuracy
## [1] 0.6064
res_holm$error
## [1] 0.3936
# 4) FDR model
res_fdr <- logit_val_error(vars_fdr, dftr, dfval)
res_fdr$accuracy
## [1] 0.6064
res_fdr$error
## [1] 0.3936
Comment: The unadjusted model at the 0.05 level retained six variables (age, diet_score, bmi, physical_activity_minutes_per_week, family_history_diabetes, and systolic_bp), and this model achieved the highest validation accuracy at 0.6208, corresponding to a validation error of 0.3792. When applying Bonferroni’s family-wise error correction, only three variables (age, physical_activity_minutes_per_week, and family_history_diabetes) remained significant, and the reduced model produced a slightly lower validation accuracy of 0.6064, with an error rate of 0.3936. Holm’s method selected the same three predictors as Bonferroni, leading to identical validation performance with an accuracy of 0.6064 and an error of 0.3936. Lastly, controlling the false discovery rate using the Benjamini–Hochberg procedure at q = 0.05 also resulted in the same three-variable model and the same validation metrics. Overall, the model without multiple-testing adjustments provided the best predictive accuracy, while all adjusted models showed slightly higher error rates due to the more restrictive selection of predictors.
If the goal is strictly prediction, the model that performs best on the validation set should be chosen, regardless of interpretability or variable selection simplicity. Across all models evaluated, the unadjusted logistic regression achieved the highest validation accuracy (0.6208), outperforming the decision tree, both random forest models, and all of the multiple-testing–adjusted logistic regressions. The random forest models produced similar but slightly lower predictive accuracy, and the decision tree performed worst due to its simplicity and limited structure. Therefore, based on predictive performance alone, I would choose the unadjusted logistic regression model. It uses a moderately sized set of predictors and provides the strongest out-of-sample prediction for diagnosed_diabetes among the approaches considered.
If the aim is to understand the relationships between predictors and diagnosed diabetes (rather than maximizing predictive accuracy), interpretability, statistical rigor, and control of false positives should be prioritized. In this case, a logistic regression model with multiple-testing correction is most appropriate. Both the Bonferroni and Holm procedures retained the same three variables (age, physical_activity_minutes_per_week, and family_history_diabetes) and these variables also remained significant under FDR control. This consistency strengthens confidence that these predictors are robustly associated with diagnosed diabetes, even after accounting for multiple comparisons. Although the reduced models do not predict as well as the unadjusted logistic model, they provide cleaner inference by limiting spurious findings. Thus, for the purpose of understanding covariate relationships, I would choose the logistic regression model using the variables that survive Holm or FDR correction, as it balances interpretability with appropriate error control.
## LE8 variables
le8vars <- c(
"diet_score",
"physical_activity_minutes_per_week",
"smoking_status",
"sleep_hours_per_day",
"bmi",
"cholesterol_total",
"hdl_cholesterol",
"hba1c",
"systolic_bp",
"diastolic_bp"
)
df_le8 <- df[, le8vars]
## model.matrix performs one-hot encoding
X_le8 <- model.matrix(
~ diet_score +
physical_activity_minutes_per_week +
smoking_status +
sleep_hours_per_day +
bmi +
cholesterol_total +
hdl_cholesterol +
hba1c +
systolic_bp +
diastolic_bp,
data = df_le8
)[, -1] # remove intercept column
## PCA with centering and scaling
le8.pca <- prcomp(X_le8, center = TRUE, scale. = TRUE)
## Eigenvalues and cumulative variance explained
le8_cve <- le8.pca$sdev^2
le8_pve <- le8_cve / sum(le8_cve)
le8_cum_pve <- cumsum(le8_pve)
le8_pve
## [1] 0.15625009 0.14614162 0.09990181 0.09422061 0.08979292 0.08679881
## [7] 0.08187005 0.08057559 0.07178823 0.05890962 0.03375065
le8_cum_pve
## [1] 0.1562501 0.3023917 0.4022935 0.4965141 0.5863071 0.6731059 0.7549759
## [8] 0.8355515 0.9073397 0.9662493 1.0000000
## Minimum number of PCs to exceed 75% cumulative variance
which(le8_cum_pve >= 0.75)[1]
## [1] 7
## Scree plot
par(mfrow = c(1, 2))
## Individual variance explained
plot(
le8_pve,
type = "b",
xlab = "# of Components",
ylab = "Proportion of Variance Explained",
main = "Scree Plot"
)
## Cumulative variance explained
plot(
le8_cum_pve,
type = "b",
ylim = c(0, 1),
xlab = "# of Components",
ylab = "Cumulative Proportion Explained",
main = "Cumulative Variance Explained"
)
abline(h = 0.75, lty = 2)
par(mfrow = c(1, 1))
## Full loadings
le8.pca$rotation
## PC1 PC2 PC3
## diet_score -0.219997571 -0.14437531 -0.047774867
## physical_activity_minutes_per_week -0.122875936 -0.01992516 -0.606440496
## smoking_statusFormer 0.259750879 -0.65628548 -0.017108451
## smoking_statusNever -0.289666467 0.64299188 -0.008377308
## sleep_hours_per_day 0.002983006 -0.01513587 0.347210872
## bmi 0.507718867 0.21718814 0.011811332
## cholesterol_total 0.349584735 0.14189226 -0.322501416
## hdl_cholesterol -0.273310582 -0.09777385 -0.325974862
## hba1c 0.264492952 0.13789308 0.415735982
## systolic_bp 0.411923102 0.17055129 -0.231705159
## diastolic_bp 0.301318306 0.09702781 -0.268513035
## PC4 PC5 PC6
## diet_score -0.628088406 -0.002885865 0.57230478
## physical_activity_minutes_per_week 0.302153182 -0.341226858 0.21217516
## smoking_statusFormer -0.005710473 0.021615200 -0.06412873
## smoking_statusNever -0.060988183 0.003276334 -0.04250832
## sleep_hours_per_day -0.262182489 -0.850195266 -0.27190115
## bmi 0.203346084 -0.022774346 -0.01320100
## cholesterol_total -0.240187619 -0.168956872 0.13380963
## hdl_cholesterol -0.399408371 0.221553707 -0.59503867
## hba1c -0.217921375 0.282851830 0.06010799
## systolic_bp -0.315225089 -0.004901813 0.13696510
## diastolic_bp -0.190507828 0.045449273 -0.39124996
## PC7 PC8 PC9
## diet_score 0.304383247 0.08445555 -0.04812606
## physical_activity_minutes_per_week 0.034945347 -0.58012593 0.05070471
## smoking_statusFormer -0.053913946 -0.02450954 -0.01196067
## smoking_statusNever 0.001763677 0.02956981 -0.01687756
## sleep_hours_per_day -0.015238898 -0.09382845 -0.05153837
## bmi 0.032183307 0.16650212 -0.23267613
## cholesterol_total -0.395637588 0.29675257 0.63325993
## hdl_cholesterol -0.331608418 -0.09928898 -0.08748093
## hba1c -0.096623024 -0.70791549 0.29567385
## systolic_bp -0.211337474 -0.13452461 -0.64118520
## diastolic_bp 0.762534634 -0.03402568 0.17515159
## PC10 PC11
## diet_score 0.32301295 0.001696639
## physical_activity_minutes_per_week 0.15266604 -0.018402305
## smoking_statusFormer 0.04286789 -0.701017602
## smoking_statusNever 0.03740983 -0.703196641
## sleep_hours_per_day 0.04209775 0.015501932
## bmi 0.75454504 0.025172997
## cholesterol_total -0.06220125 -0.005484053
## hdl_cholesterol 0.32636874 0.113163421
## hba1c 0.09576437 -0.003245414
## systolic_bp -0.39433662 -0.003712171
## diastolic_bp -0.15149387 -0.003597924
## Loadings for PC1 and PC2
round(le8.pca$rotation[, 1:2], 3)
## PC1 PC2
## diet_score -0.220 -0.144
## physical_activity_minutes_per_week -0.123 -0.020
## smoking_statusFormer 0.260 -0.656
## smoking_statusNever -0.290 0.643
## sleep_hours_per_day 0.003 -0.015
## bmi 0.508 0.217
## cholesterol_total 0.350 0.142
## hdl_cholesterol -0.273 -0.098
## hba1c 0.264 0.138
## systolic_bp 0.412 0.171
## diastolic_bp 0.301 0.097
## Helper function for effect-size categories
effect_cat <- function(x) {
ax <- abs(x)
case_when(
ax < 0.2 ~ "minimal",
ax < 0.3 ~ "small",
ax < 0.4 ~ "moderate",
TRUE ~ "large"
)
}
## Table for interpreting PC1 and PC2
pc12_loadings <- as.data.frame(le8.pca$rotation[, 1:2])
pc12_loadings$PC1_effect <- effect_cat(pc12_loadings$PC1)
pc12_loadings$PC2_effect <- effect_cat(pc12_loadings$PC2)
pc12_loadings
## PC1 PC2 PC1_effect
## diet_score -0.219997571 -0.14437531 small
## physical_activity_minutes_per_week -0.122875936 -0.01992516 minimal
## smoking_statusFormer 0.259750879 -0.65628548 small
## smoking_statusNever -0.289666467 0.64299188 small
## sleep_hours_per_day 0.002983006 -0.01513587 minimal
## bmi 0.507718867 0.21718814 large
## cholesterol_total 0.349584735 0.14189226 moderate
## hdl_cholesterol -0.273310582 -0.09777385 small
## hba1c 0.264492952 0.13789308 small
## systolic_bp 0.411923102 0.17055129 large
## diastolic_bp 0.301318306 0.09702781 moderate
## PC2_effect
## diet_score minimal
## physical_activity_minutes_per_week minimal
## smoking_statusFormer large
## smoking_statusNever large
## sleep_hours_per_day minimal
## bmi small
## cholesterol_total minimal
## hdl_cholesterol minimal
## hba1c minimal
## systolic_bp minimal
## diastolic_bp minimal
biplot(le8.pca, scale = 0,
xlab = "PC1",
ylab = "PC2",
main = "Biplot of LE8 PCA")
Comment: The PCA was conducted with centering and scaling so that all variables contributed on a comparable scale. Examination of the proportion of variance explained by each component showed that the first few components captured a moderate share of the variation (PC1 = 15.6%, PC2 = 14.6%, PC3 = 10.0%, PC4 = 9.4%). The cumulative variance explained increased steadily, reaching approximately 75.5% after the first seven components. Therefore, seven principal components are required to retain at least 75 percent of the total variation in the LE8 variables. The scree plot and cumulative variance plot both show this point clearly.
PC1 is primarily defined by the variables related to BMI, cholesterol control, and blood pressure, with several large or moderate loadings across these groups. BMI shows a large positive loading (0.508), indicating that higher BMI strongly increases PC1 scores. Systolic blood pressure (0.412) also contributes a large positive effect, and total cholesterol (0.350) and diastolic blood pressure (0.301) contribute moderate positive effects. HDL cholesterol shows a small negative loading (-0.273), suggesting lower HDL increases PC1. Factors representing tobacco exposure (smoking_statusFormer = 0.260; smoking_statusNever = -0.290) show small contributions but in opposite directions. Diet score (-0.220) and physical activity (-0.123) have small-to-minimal negative effects, and sleep and blood sugar measures (sleep_hours_per_day and hba1c) have minimal influence. Overall, PC1 represents a general “cardiometabolic risk” dimension, with higher PC1 scores reflecting higher BMI, higher blood pressure, and higher cholesterol levels, along with modest influence from smoking history.
PC2 PC2 is defined most strongly by smoking-related variables. Both tobacco indicators show large contributions but in opposite directions: smoking_statusFormer loads negatively (-0.657) and smoking_statusNever loads positively (0.643). This makes smoking behavior the dominant factor separating individuals along the PC2 axis. Diet score (-0.144), physical activity (-0.020), hba1c (0.138), and systolic/diastolic blood pressure have minimal influences. BMI (0.217) and total cholesterol (0.142) contribute small-to-minimal associations. Thus, PC2 primarily represents a “smoking behavior” axis, where higher PC2 values correspond to never-smokers, and lower values correspond to former smokers.
To summarize,
PC1 = a cardiometabolic risk profile (Dominated by BMI, blood pressure, and cholesterol, with moderate influence from smoking category. High PC1 scores reflect unfavorable metabolic and cardiovascular measures.)
PC2 = a smoking-status axis (Driven almost entirely by the two smoking indicator variables, distinguishing never-smokers from former smokers, with minimal contribution from the other LE8 components.)
The biplot displays three visible clusters because the PCA includes the smoking_status factor, which was converted into two indicator variables and implicitly represents three categories: never smokers, former smokers, and current smokers. These smoking indicators have the largest loadings on PC2, causing the three smoking groups to separate distinctly along that axis, while PC1 captures variation in cardiometabolic risk through BMI, blood pressure, and cholesterol levels, creating additional spread within each smoking category. As a result, the combination of a strongly separating categorical variable (smoking status) and continuous variation in metabolic health produces three natural clusters in the biplot.
## Use the smallest number of PCs that explain at least 75 percent
pc_keep <- which(le8_cum_pve >= 0.75)[1]
pc_keep
## [1] 7
## PCA scores for the selected components
pc_scores <- le8.pca$x[, 1:pc_keep]
## Distance matrix
dist_pc <- dist(pc_scores)
## Hierarchical clustering with three linkage methods
hc_complete <- hclust(dist_pc, method = "complete")
hc_average <- hclust(dist_pc, method = "average")
hc_single <- hclust(dist_pc, method = "single")
## Plot dendrograms for each linkage
par(mfrow = c(1, 3))
plot(hc_complete,
main = "Hierarchical Clustering (Complete Linkage)",
xlab = "",
sub = "",
cex = 0.6)
plot(hc_average,
main = "Hierarchical Clustering (Average Linkage)",
xlab = "",
sub = "",
cex = 0.6)
plot(hc_single,
main = "Hierarchical Clustering (Single Linkage)",
xlab = "",
sub = "",
cex = 0.6)
par(mfrow = c(1, 1))
## Cut each dendrogram into 3 clusters and get cluster sizes
clust_complete3 <- cutree(hc_complete, k = 3)
clust_average3 <- cutree(hc_average, k = 3)
clust_single3 <- cutree(hc_single, k = 3)
table(clust_complete3)
## clust_complete3
## 1 2 3
## 792 1964 2244
table(clust_average3)
## clust_average3
## 1 2 3
## 4997 2 1
table(clust_single3)
## clust_single3
## 1 2 3
## 4998 1 1
## Add complete-linkage cluster membership to the full dataset
df$cluster_complete3 <- factor(clust_complete3)
## Compute means of all numeric variables by cluster
cluster_means <- df %>%
group_by(cluster_complete3) %>%
summarise(
across(
where(is.numeric),
~ mean(.x, na.rm = TRUE)
)
)
df <- df %>%
mutate(
smoke_former = ifelse(smoking_status == "Former", 1, 0),
smoke_never = ifelse(smoking_status == "Never", 1, 0)
)
cluster_means_le8 <- df %>%
group_by(cluster_complete3) %>%
summarise(
across(
c(
"diet_score",
"physical_activity_minutes_per_week",
"sleep_hours_per_day",
"bmi",
"cholesterol_total",
"hdl_cholesterol",
"hba1c",
"systolic_bp",
"diastolic_bp",
"smoke_former",
"smoke_never",
"diagnosed_diabetes"
),
~mean(.x, na.rm = TRUE)
)
)
## Warning: There were 3 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `across(...)`.
## ℹ In group 1: `cluster_complete3 = 1`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.
print(cluster_means, width = Inf)
## # A tibble: 3 × 15
## cluster_complete3 age physical_activity_minutes_per_week diet_score
## <fct> <dbl> <dbl> <dbl>
## 1 1 51.9 179. 6.18
## 2 2 53.3 91.1 5.69
## 3 3 45.9 121. 6.08
## sleep_hours_per_day screen_time_hours_per_day bmi systolic_bp diastolic_bp
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.29 5.91 26.6 120. 75.0
## 2 6.78 6.03 27.0 120. 77.2
## 3 7.14 5.92 24.1 110. 73.8
## heart_rate cholesterol_total hdl_cholesterol ldl_cholesterol triglycerides
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 68.7 197. 48.1 119. 126.
## 2 70.3 191. 51.9 110. 129.
## 3 69.5 177. 57.9 91.4 112.
## hba1c
## <dbl>
## 1 6.34
## 2 6.82
## 3 6.28
print(cluster_means_le8, width = Inf)
## # A tibble: 3 × 13
## cluster_complete3 diet_score physical_activity_minutes_per_week
## <fct> <dbl> <dbl>
## 1 1 6.18 179.
## 2 2 5.69 91.1
## 3 3 6.08 121.
## sleep_hours_per_day bmi cholesterol_total hdl_cholesterol hba1c systolic_bp
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.29 26.6 197. 48.1 6.34 120.
## 2 6.78 27.0 191. 51.9 6.82 120.
## 3 7.14 24.1 177. 57.9 6.28 110.
## diastolic_bp smoke_former smoke_never diagnosed_diabetes
## <dbl> <dbl> <dbl> <dbl>
## 1 75.0 0.174 0.535 NA
## 2 77.2 0.254 0.526 NA
## 3 73.8 0.160 0.705 NA
## Ensure diagnosed_diabetes is coded as 0/1 or factor with two levels
## For safety, convert to factor here:
df$diagnosed_diabetes <- as.factor(df$diagnosed_diabetes)
## Fit logistic regression with cluster membership as predictor
glm_cluster <- glm(
diagnosed_diabetes ~ cluster_complete3,
data = df,
family = binomial
)
summary(glm_cluster)
##
## Call:
## glm(formula = diagnosed_diabetes ~ cluster_complete3, family = binomial,
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.03536 0.07108 0.497 0.619
## cluster_complete32 0.95741 0.08737 10.958 <2e-16 ***
## cluster_complete33 -0.07458 0.08268 -0.902 0.367
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6778.6 on 4999 degrees of freedom
## Residual deviance: 6500.2 on 4997 degrees of freedom
## AIC: 6506.2
##
## Number of Fisher Scoring iterations: 4
## Predicted class based on 0.5 cutoff
prob_cluster <- predict(glm_cluster, type = "response")
## If factor levels are "0", "1", map probabilities to those labels
pred_cluster <- ifelse(prob_cluster >= 0.5, "1", "0")
pred_cluster <- factor(pred_cluster, levels = levels(df$diagnosed_diabetes))
## Confusion matrix comparing predicted vs observed
conf_mat_cluster <- confusionMatrix(
data = pred_cluster,
reference = df$diagnosed_diabetes
)
conf_mat_cluster
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1144 1100
## 1 920 1836
##
## Accuracy : 0.596
## 95% CI : (0.5822, 0.6096)
## No Information Rate : 0.5872
## P-Value [Acc > NIR] : 0.1057
##
## Kappa : 0.1773
##
## Mcnemar's Test P-Value : 6.814e-05
##
## Sensitivity : 0.5543
## Specificity : 0.6253
## Pos Pred Value : 0.5098
## Neg Pred Value : 0.6662
## Prevalence : 0.4128
## Detection Rate : 0.2288
## Detection Prevalence : 0.4488
## Balanced Accuracy : 0.5898
##
## 'Positive' Class : 0
##
Comment: After retaining the first seven principal components, which together explain at least 75 percent of the total variance, hierarchical clustering was performed on the PC score matrix using complete, average, and single linkage methods. The dendrograms for each linkage show visibly different clustering structures, reflecting how each method defines inter-cluster distance. When the trees were cut to form three clusters, the cluster sizes varied substantially depending on the linkage method. Complete linkage produced a relatively balanced distribution across the three clusters (792, 1964, and 2244 individuals), suggesting that it separated groups based on overall dissimilarity across all members. In contrast, average linkage yielded one extremely large cluster (n = 4997) and two very small clusters (n = 2 and n = 1), and single linkage exhibited the same pattern (n = 4998, 1, and 1), indicating severe chaining and poor separation of meaningful groups. These results show that complete linkage is the only method that forms interpretable and non-degenerate clusters in this dataset.
Using the complete-linkage solution with three clusters, the mean profiles of the variables revealed clear behavioral and cardiometabolic distinctions across clusters. Cluster 1 showed the highest physical activity levels (about 179 minutes per week), relatively favorable diet scores, and intermediate BMI and blood pressure, suggesting a more physically active and moderately healthy group. Cluster 2 had the lowest physical activity (around 91 minutes), the lowest diet scores, and the highest BMI, systolic and diastolic blood pressure, and triglycerides, representing a higher metabolic risk profile. Cluster 3 was younger on average, with physical activity levels between clusters 1 and 2, the lowest BMI, the highest HDL cholesterol, and the lowest total cholesterol and triglycerides, reflecting a comparatively healthier lipid and body composition profile. Smoking patterns also differed, with Cluster 3 having the highest proportion of never smokers and the lowest proportion of former smokers. Together, these results indicate that complete-linkage clustering separated individuals into groups that meaningfully differ in activity patterns, adiposity, lipid profiles, and related cardiometabolic characteristics.
Using cluster membership as the sole predictor in a logistic regression model resulted in modest discrimination of diagnosed diabetes. The model identified a significantly higher odds of diabetes in Cluster 2 compared with Cluster 1, while Cluster 3 did not differ meaningfully from the reference group. However, this limited separation among clusters translated into relatively low predictive performance. The overall accuracy was about 0.596, only slightly above the no-information rate of 0.587, and the Kappa statistic of 0.177 indicated poor agreement beyond chance. Sensitivity and specificity were balanced but modest, and the confusion matrix showed substantial misclassification in both directions.
Compared with the supervised learning methods in Problem 1, this accuracy is somewhat lower. In Problem 1, the classification tree achieved an accuracy of about 0.57, while the random forest models reached around 0.59–0.60, and logistic regression models using individual predictors achieved similar or slightly higher levels depending on variable selection. Because cluster membership is derived from unsupervised learning and does not directly optimize prediction of diagnosed diabetes, this reduced accuracy is expected. Overall, the clustering-based model does not outperform the supervised models, reinforcing that unsupervised grouping is not an effective predictor of diabetes in this dataset.